NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

Nie, F; Hou, X; Lin, S; Zou, J; Yao, H; Zhang, L (July 2025, Proceedings of Machine Learning Research)

The propensity of large language models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored. In this paper, we introduce FACTTEST, a novel framework that statistically assesses whether an LLM can provide correct answers to given questions with high-probability correctness guarantees. We formulate hallucina- tion detection as a hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that FACTTEST also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. FACTTEST is distribution-free and and model-agnostic. It works for any number of human-annotated samples and applies to any black-box or white-box LM. Extensive experiments demonstrate that FACTTEST effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement.
more » « less
Free, publicly-accessible full text available July 17, 2026
SAFREE: Training-free and Adaptive Guard for Safe Text-to-image and Video Generation

Yoon, J; Yu, S; Patil, V; Yao, H; Bansal, M (April 2025, Proceedings of the International Conference on Learning Representations)

Free, publicly-accessible full text available April 24, 2026
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Xia, P; Zhu, K; Li, H; Wang, T; Shi, W; Wang, S; Zhang, L; Zou, J; Yao, H (April 2025, ICLR)

Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in factual accuracy in the factual accuracy of Med-LVLMs.
more » « less
Free, publicly-accessible full text available April 24, 2026
Seasonal synchronicity and multi-decadal stability of headwater biogeochemistry in the northern temperate zone

https://doi.org/10.1007/s10533-025-01263-2

Harms, Tamara_K; Hood, Jim; Scheuerell, Mark_D; Creed, Irena; Campbell, John_L; Fernandez, I.; Higgins, S_N; Johnson, Sherri_L; Shanley, James_B; Sebestyen, Stephen; et al (September 2025, Biogeochemistry)

Abstract Temporal patterns in chemistry of headwater streams reflect responses of water and elemental cycles to perturbations occurring at local to global scales. We evaluated multi-scale temporal patterns in up to 32 y of monthly observations of stream chemistry (ammonium, calcium, dissolved organic carbon, nitrate, total dissolved phosphorus, and sulfate) in 22 reference catchments within the northern temperate zone of North America. Multivariate autoregressive state-space (MARSS) models were applied to quantify patterns at multi-decadal, seasonal, and shorter intervals during a period that encompassed warming climate, seasonal changes in precipitation, and regional declines in atmospheric deposition. Significant long-term trends in solute concentrations within a subset of the catchments were consistent with recovery from atmospheric deposition (e.g., calcium, nitrate, sulfate) and increased precipitation (e.g., dissolved organic carbon). Lack of evidence for multi-decadal trends in most catchments suggests resilience of northern temperate ecosystems or that subtle net effects of simultaneous changes in climate and disturbance regimes do not result in directional trends. Synchronous seasonal oscillations of solute concentrations occurred across many catchments, reflecting shared climate and biotic drivers of seasonality within the northern temperate zone. Despite shared patterns among catchments at a seasonal scale, multi-scale temporal patterns were statistically distinct among even adjacent headwater catchments, implying that local attributes of headwater catchments modify the signals imparted by atmospheric phenomena and regional disturbances. To effectively characterize hydrologic and biogeochemical responses to changing climate and disturbance regimes, catchment monitoring programs could include multiple streams with contributing areas that encompass regional heterogeneity in vegetation, topography, and elevation. Overall, detection of long-term patterns and trends requires monitoring multiple catchments at a frequency that captures periodic variation (e.g., seasonality) and a duration encompassing the perturbations of interest.
more » « less
S2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Yang, X; Leng, J; Guo, G; Zhao, J; Nakada, R; Zhang, L; Yao, H; Chen, B (December 2024, NeurIPS)

Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of \underline{S}tructured \underline{S}parse \underline{F}ine-\underline{T}uning (\textbf{\model}) methods for LLMs, which \textit{concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability}. \model \mbox{accomplishes this by ``selecting sparsely and computing densely". It selects a few} heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, \model performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents overfitting and forgetting, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6$$\%$$ and 1.3$$\%$$ average improvements compared to LoRA, and surpasses full FT by 11.5$$\%$$ when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, \model saves training memory up to 3$$\times$$ and improves latency by 1.5-2.7$$\times$$ compared to full FT, while delivering an average 10\% improvement over LoRA on both metrics. We further demonstrate that the weight updates in \model can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.
more » « less
Full Text Available
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Xia, P; Zhu, K; Li, H; Zhu, H; Li, Y; Li, G; Zhang, L; Yao, H (November 2024, EMNLP)

The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model’s generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RAFE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy.
more » « less
Full Text Available
Calibrated Self-Rewarding Vision Language Models

Zhou, Y; Fan, Z; Cheng, D; Yang, S; Chen, Z; Cui, C; Wang, X; Li, Y; Zhang, L; Yao, H (December 2024, NeurIPS)

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across twelve benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning.
more » « less
Full Text Available
Forward and Inverse Energy Cascade in Fluid Turbulence Adhere to Kolmogorov’s Refined Similarity Hypothesis

https://doi.org/10.1103/PhysRevLett.132.164001

Yao, H.; Yeung, P. K.; Zaki, T. A.; Meneveau, C. (April 2024, Physical Review Letters)
Meta-Learning with Neural Bandit Scheduler

Qi, Y; Ban, Y; Wei, T; Zou, J; Yao, H; He, J (December 2023, NeurIPS 2023)

Full Text Available
Sequential change‐point detection: Computation versus statistical performance

https://doi.org/10.1002/wics.1628

Wang, Haoyun; Xie, Yao (July 2023, WIREs Computational Statistics)

Abstract Change‐point detection studies the problem of detecting the changes in the underlying distribution of the data stream as soon as possible after the change happens. Modern large‐scale, high‐dimensional, and complex streaming data call for computationally (memory) efficient sequential change‐point detection algorithms that are also statistically powerful. This gives rise to a computation versus statistical power trade‐off, an aspect less emphasized in the past in classic literature. This tutorial takes this new perspective and reviews several sequential change‐point detection procedures, ranging from classic sequential change‐point detection algorithms to more recent non‐parametric procedures that consider computation, memory efficiency, and model robustness in the algorithm design. Our survey also contains classic performance analysis, which provides useful techniques for analyzing new procedures. This article is categorized under:Statistical Models > Time Series ModelsAlgorithms and Computational Methods > AlgorithmsData: Types and Structure > Time Series, Stochastic Processes, and Functional Data
more » « less

« Prev Next »

Search for: All records